我们介绍了CVSS,这是一种大规模的多语言对语音转换(S2ST)语料库,从21种语言覆盖了21种语言的句子级并行S2ST对。通过将Covost 2从Covost 2的翻译文本综合将翻译文本与最先进的TTS系统合成语音,源自公共语音语音语料库和COVOST 2语音到文本转换(ST)语料库。提供了两个版本的翻译演讲:1)CVSS-C:所有翻译演讲都是一种高质量的规范声音; 2)CVSS-T:翻译语音从相应的源语音传输。此外,CVSS提供标准化的翻译文本,它与翻译语音中的发音匹配。在每个版本的CVSS上,我们建立了基线多语言直接S2ST模型和Cascade S2ST模型,验证了语料库的有效性。为了构建强大的Cascade S2ST基准,我们在Covost 2上培训了St模型,这优于前一种最先进的培训,而无需额外的数据。尽管如此,直接S2ST模型的性能在从头开始训练时接近强级联基线,并且在匹配ST模型中初始化时,仅在ASR转换转换时的0.1或0.7bleu差异。
translated by 谷歌翻译
在本文中,我们呈现VDTTS,一个视觉驱动的文本到语音模型。通过配音而激励,VDTTS利用视频帧作为伴随文本的附加输入,并生成与视频信号匹配的语音。我们展示了这允许VDTTS,与普通的TTS模型不同,产生不仅具有自然暂停和间距等韵律变化的语音,而且还与输入视频同步。实验,我们显示我们的模型产生良好的同步输出,接近地面真理的视频语音同步质量,在几个具有挑战性的基准中,包括来自VoxceleB2的“野外”内容。我们鼓励读者查看演示视频,演示视频语音同步,对扬声器ID交换和韵律的鲁棒性。
translated by 谷歌翻译
我们呈现TranslatOrron 2,一个神经直接语音转换转换模型,可以训练结束到底。 TranslatOrron 2由语音编码器,音素解码器,MEL谱图合成器和连接所有前三个组件的注意模块组成。实验结果表明,翻译ron 2在翻译质量和预测的语音自然方面,通过大幅度优于原始翻译,并且通过减轻超越,例如唠叨或长暂停来大幅提高预测演讲的鲁棒性。我们还提出了一种在翻译语音中保留源代言人声音的新方法。训练有素的模型被限制为保留源扬声器的声音,但与原始翻译ron不同,它无法以不同的扬声器的语音产生语音,使模型对生产部署更加强大,通过减轻潜在的滥用来创建欺骗音频伪影。当新方法与基于简单的替代的数据增强一起使用时,训练的翻译器2模型能够保留每个扬声器的声音,以便用扬声器转动输入输入。
translated by 谷歌翻译
Real-time individual endpoint prediction has always been a challenging task but of great clinic utility for both patients and healthcare providers. With 6,879 chronic kidney disease stage 4 (CKD4) patients as a use case, we explored the feasibility and performance of gated recurrent units with decay that models Weibull probability density function (GRU-D-Weibull) as a semi-parametric longitudinal model for real-time individual endpoint prediction. GRU-D-Weibull has a maximum C-index of 0.77 at 4.3 years of follow-up, compared to 0.68 achieved by competing models. The L1-loss of GRU-D-Weibull is ~66% of XGB(AFT), ~60% of MTLR, and ~30% of AFT model at CKD4 index date. The average absolute L1-loss of GRU-D-Weibull is around one year, with a minimum of 40% Parkes serious error after index date. GRU-D-Weibull is not calibrated and significantly underestimates true survival probability. Feature importance tests indicate blood pressure becomes increasingly important during follow-up, while eGFR and blood albumin are less important. Most continuous features have non-linear/parabola impact on predicted survival time, and the results are generally consistent with existing knowledge. GRU-D-Weibull as a semi-parametric temporal model shows advantages in built-in parameterization of missing, native support for asynchronously arrived measurement, capability of output both probability and point estimates at arbitrary time point for arbitrary prediction horizon, improved discrimination and point estimate accuracy after incorporating newly arrived data. Further research on its performance with more comprehensive input features, in-process or post-process calibration are warranted to benefit CKD4 or alike terminally-ill patients.
translated by 谷歌翻译
Humans use all of their senses to accomplish different tasks in everyday activities. In contrast, existing work on robotic manipulation mostly relies on one, or occasionally two modalities, such as vision and touch. In this work, we systematically study how visual, auditory, and tactile perception can jointly help robots to solve complex manipulation tasks. We build a robot system that can see with a camera, hear with a contact microphone, and feel with a vision-based tactile sensor, with all three sensory modalities fused with a self-attention model. Results on two challenging tasks, dense packing and pouring, demonstrate the necessity and power of multisensory perception for robotic manipulation: vision displays the global status of the robot but can often suffer from occlusion, audio provides immediate feedback of key moments that are even invisible, and touch offers precise local geometry for decision making. Leveraging all three modalities, our robotic system significantly outperforms prior methods.
translated by 谷歌翻译
Image analysis technologies empowered by artificial intelligence (AI) have proved images and videos to be an opportune source of data to learn about humpback whale (Megaptera novaeangliae) population sizes and dynamics. With the advent of social media, platforms such as YouTube present an abundance of video data across spatiotemporal contexts documenting humpback whale encounters from users worldwide. In our work, we focus on automating the classification of YouTube videos as relevant or irrelevant based on whether they document a true humpback whale encounter or not via deep learning. We use a CNN-RNN architecture pretrained on the ImageNet dataset for classification of YouTube videos as relevant or irrelevant. We achieve an average 85.7% accuracy, and 84.7% (irrelevant)/ 86.6% (relevant) F1 scores using five-fold cross validation for evaluation on the dataset. We show that deep learning can be used as a time-efficient step to make social media a viable source of image and video data for biodiversity assessments.
translated by 谷歌翻译
Recent work has shown that machine learning (ML) models can be trained to accurately forecast the dynamics of unknown chaotic dynamical systems. Such ML models can be used to produce both short-term predictions of the state evolution and long-term predictions of the statistical patterns of the dynamics (``climate''). Both of these tasks can be accomplished by employing a feedback loop, whereby the model is trained to predict forward one time step, then the trained model is iterated for multiple time steps with its output used as the input. In the absence of mitigating techniques, however, this technique can result in artificially rapid error growth, leading to inaccurate predictions and/or climate instability. In this article, we systematically examine the technique of adding noise to the ML model input during training as a means to promote stability and improve prediction accuracy. Furthermore, we introduce Linearized Multi-Noise Training (LMNT), a regularization technique that deterministically approximates the effect of many small, independent noise realizations added to the model input during training. Our case study uses reservoir computing, a machine-learning method using recurrent neural networks, to predict the spatiotemporal chaotic Kuramoto-Sivashinsky equation. We find that reservoir computers trained with noise or with LMNT produce climate predictions that appear to be indefinitely stable and have a climate very similar to the true system, while reservoir computers trained without regularization are unstable. Compared with other types of regularization that yield stability in some cases, we find that both short-term and climate predictions from reservoir computers trained with noise or with LMNT are substantially more accurate. Finally, we show that the deterministic aspect of our LMNT regularization facilitates fast hyperparameter tuning when compared to training with noise.
translated by 谷歌翻译
Human and robot partners increasingly need to work together to perform tasks as a team. Robots designed for such collaboration must reason about how their task-completion strategies interplay with the behavior and skills of their human team members as they coordinate on achieving joint goals. Our goal in this work is to develop a computational framework for robot adaptation to human partners in human-robot team collaborations. We first present an algorithm for autonomously recognizing available task-completion strategies by observing human-human teams performing a collaborative task. By transforming team actions into low dimensional representations using hidden Markov models, we can identify strategies without prior knowledge. Robot policies are learned on each of the identified strategies to construct a Mixture-of-Experts model that adapts to the task strategies of unseen human partners. We evaluate our model on a collaborative cooking task using an Overcooked simulator. Results of an online user study with 125 participants demonstrate that our framework improves the task performance and collaborative fluency of human-agent teams, as compared to state of the art reinforcement learning methods.
translated by 谷歌翻译
这项研究提出了一个基于移动网格参数化的端到端无监督的差异可变形登记框架。使用此参数化,可以使用其转换雅各布的决定因素和末端速度场的卷曲来建模。变形场的新模型具有三个重要优势。首先,它放松了对成本函数的显式正则化项和相应重量的需求。平滑度隐含在溶液中,从而导致物理上合理的变形场。其次,它通过适用于转换雅各布决定因素的明确约束来保证差异性。最后,它适用于心脏数据处理,因为该参数化的性质是根据​​径向和旋转成分定义变形场。通过在包括2D和3D心脏MRI扫描在内的三个不同数据集上评估拟议方法来研究算法的有效性。结果表明,所提出的框架在生成差异变换的同时优于现有的基于学习的方法和基于非学习的方法。
translated by 谷歌翻译
越来越需要在各种新的硬件平台上为不同任务部署机器学习。这样的部署场景需要应对多个挑战,包括确定可以实现合适的预测准确性(体系结构搜索)的模型体系结构,并找到有效的模型实施,以满足基础硬件特定的系统约束,例如延迟(系统优化搜索)。现有作品将架构搜索和系统优化搜索视为单独的问题,并将其顺序解决。在本文中,我们建议共同解决这些问题,并引入一种简单但有效的基线方法,称为Sonar,该方法交织了这两个搜索问题。 Sonar的目标是通过将早期停止应用于两个搜索过程来有效地优化预测准确性和推理潜伏期。我们对多个不同硬件后端的实验表明,Sonar识别出几乎最佳体系结构的速度比蛮力方法快30倍。
translated by 谷歌翻译